Super-Scalable Algorithms for Computing on 100, 000 Processors

نویسندگان

  • Christian Engelmann
  • Al Geist
چکیده

In the next five years, the number of processors in high-end systems for scientific computing is expected to rise to tens and even hundreds of thousands. For example, the IBM Blue Gene/L can have up to 128,000 processors and the delivery of the first system is scheduled for 2005. Existing deficiencies in scalability and fault-tolerance of scientific applications need to be addressed soon. If the number of processors grows by a magnitude and efficiency drops by a magnitude, the overall effective computing performance stays the same. Furthermore, the mean time to interrupt of high-end computer systems decreases with scale and complexity. In a 100,000-processor system, failures may occur every couple of minutes and traditional checkpointing may no longer be feasible. With this paper, we summarize our recent research in super-scalable algorithms for computing on 100,000 processors. We introduce the algorithm properties of scale invariance and natural fault tolerance, and discuss how they can be applied to two different classes of algorithms. We also describe a super-scalable diskless checkpointing algorithm for problems that can’t be transformed into a super-scalable variant, or where other solutions are more efficient. Finally, a 100,000-processor simulator is presented as a platform for testing and experimentation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Static Task Allocation in Distributed Systems Using Parallel Genetic Algorithm

Over the past two decades, PC speeds have increased from a few instructions per second to several million instructions per second. The tremendous speed of today's networks as well as the increasing need for high-performance systems has made researchers interested in parallel and distributed computing. The rapid growth of distributed systems has led to a variety of problems. Task allocation is a...

متن کامل

Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors

This paper describes ongoing research at Oak Ridge National Laboratory into the issues and potential problems of algorithm scalability to 100,000 processor systems. Such massively parallel computers are projected to be needed to reach a petaflops computational speed before 2010. And to make such hypothetical machines a reality, IBM Research has begun developing a computer named “BlueGene” that ...

متن کامل

Scalable Heuristic Algorithms for the Parallel Execution of Data Flow Acyclic Digraphs

Data flow acyclic directed graphs (digraphs) can be applied to accurately describe the data dependency for a wide range of grid-based scientific computing applications ranging from numerical algebra to realistic applications of radiation or neutron transport. The parallel computing of these applications is equivalent to the parallel execution of digraphs. This paper presents a framework of scal...

متن کامل

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

An Optimal Utilization of Cloud Resources using Adaptive Back Propagation Neural Network and Multi-Level Priority Queue Scheduling

With the innovation of cloud computing industry lots of services were provided based on different deployment criteria. Nowadays everyone tries to remain connected and demand maximum utilization of resources with minimum timeand effort. Thus, making it an important challenge in cloud computing for optimum utilization of resources. To overcome this issue, many techniques have been proposed ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005